Wine Quality DataSet Prediction¶

Group 8: Zoey Ma, Erin Dougall, Jack Rong

Introduction:¶

  • Having exceptional wine taste preferences has become a very revered skill over time, that few people make a career out of. Determining the 'quality' of a wine is based on human preference but these preferences are often influenced by physicochemical and sensory variables (Cortez, P., et al, 2009). We want to see if we can create a more data-driven approach to the classification of wine quality. Similar models have been created and their systems ranked the wines very similarly to experts (Petropoulos, S., et al, 2017).

  • Our model will answer the question, what quality ranking will a wine receive based on its pH and alcohol levels?

  • The data set we will be using is the ‘Wine Quality Data Set’ found on UCI and created by researchers at the University of Minho in Portugal. The data set focuses on red Portuguese ‘Vinho Verde’ wines. It has input variables based on physicochemical tests such as acidity, pH, alcohol level, etc. which all lead to the output of a quality score from 0-10.

  • Input variables (based on physicochemical tests):

  1. fixed acidity: most acids involved with wine or fixed or nonvolatile
  2. volatile acidity: the amount of acetic acid in wine
  3. citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines
  4. residual sugar: the amount of sugar remaining after fermentation stops
  5. chlorides: the amount of salt in the wine
  6. free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion
  7. total sulfur dioxide: amount of free and bound forms of S02
  8. density: the density of water is close to that of water depending on the percent alcohol and sugar content
  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic)
  10. sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels
  11. alcohol: the percent alcohol content of the wine
  • Output variable (based on sensory data):
  1. quality: output variable (based on sensory data, score between 0 and 10)

Preliminary exploratory data analysis:¶

In [256]:
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [269]:
wine_quality_data = pd.read_csv("winequality-red.csv",sep=";")#.columns.str.replace(' ', '_')
wine_quality_data.columns = wine_quality_data.columns.str.replace(' ','_',regex=True)
wine_quality_data
Out[269]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 5
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 5
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
... ... ... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 5
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

1599 rows × 12 columns

In [258]:
wine_quality_data.isnull().sum()

# No null values present in data frame.
Out[258]:
fixed_acidity           0
volatile_acidity        0
citric_acid             0
residual_sugar          0
chlorides               0
free_sulfur_dioxide     0
total_sulfur_dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64
In [259]:
variables = wine_quality_data.columns.values
print(variables)
['fixed_acidity' 'volatile_acidity' 'citric_acid' 'residual_sugar'
 'chlorides' 'free_sulfur_dioxide' 'total_sulfur_dioxide' 'density' 'pH'
 'sulphates' 'alcohol' 'quality']
In [260]:
#Here we can say most of wines are between 5 to 6 range which is average. 3 is the lowest qulity and 8 is the highest.

wine_quality_data['quality'].value_counts(normalize = True)
Out[260]:
5    0.425891
6    0.398999
7    0.124453
4    0.033146
8    0.011257
3    0.006254
Name: quality, dtype: float64

Methods:¶

  • We will use a KNN classifier to predict the wine quality using the volatile acidity and citric_acid columns since they are two common factors that contribute to the wine quality. It's also the most useful features as presented in our predictor box plots below. Although the wine quality is a numeric quantity in the dataset, we will use a classifier rather than regression since the quality is actually an ordinal variable rating (integers from 0-10) so we will treat it as a class/category. We will split it into three labels, which are poor (quality from 0 - 4),normal (quality from 5 -6), and excellent (quality from 7 to 10). We will find the best k-value between 1-100 using cross-validation and grid search with the training set. After doing so, we will use the best k-value to build a model on the entire training set and use it to predict on the test set to determine our classifier's accuracy.

  • As an intermediate step, we can call the best_score_ and best_accuracy methods which van give us the best accuracy during cross-validation with grid search using mean test score. This will allow us to easily determine the best k-value as well as see how the accuracy changes with different k-values.

  • We will visualize our results using a confusion matrix to see when and how many times we have predicted the correct label vs. the incorrect label.

We decided to create a correlation map to find the strength and direction of the relationship between variables. The coefficient ranges from -1 to 1, with a value of 0 indicating no correlation, a positive value indicating a positive correlation, and a negative value indicating a negative correlation.

In [252]:
# Correlation map 
corr_matrix = wine_quality_data.corr()

# Create a heatmap of the correlations
sns.heatmap(corr_matrix, annot=True, cmap="YlGnBu")
plt.title('Correlation Map of Red Wine Quality')

# Display the heatmap
plt.show()
In [270]:
# We decided to split quality into 3 labels, which are poor (quality from 0 - 4),normal (quality from 5 -6),
# and excellent (quality from 7 to 10)
bins = [0, 4, 6, 10]
labels = ["poor","normal","excellent"]
wine_quality_data['quality_label'] = pd.cut(wine_quality_data['quality'], bins=bins, labels=labels)
wine_quality_data.drop('quality',axis =1, inplace = True)
wine_quality_data.head()
Out[270]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol quality_label
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 normal
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 normal
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 normal
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 normal
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 normal
In [262]:
#Pairplot: to better determine the relationships between pairs of variables in a dataset.
#we decided to create pairlots for our red wine quality 
pd.plotting.scatter_matrix(wine_quality_data, figsize=(20, 20))

# Display the pair plot
plt.show()

Then, we will create box plot for these two variables below to compare the distribution of them across different red wine labels.

In [280]:
#Box plot for volatile acidity
sns.histplot(data=wine_quality_data,x="volatile_acidity",hue="quality_label",kde=True)
plt.show()
In [281]:
#Box plot for volatile acidity
sns.histplot(data=wine_quality_data,x="citric_acid",hue="quality_label",kde=True)
plt.show()

To determine wine qulity, volatile acidity and citric acid can be our important features as the overlap of their distributions are vary less in comparison to others.

In [161]:
#Splitting the dataset
wine_train, wine_test = train_test_split(
    wine_quality_data, train_size = 0.75
)
wine_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1199 entries, 20 to 1440
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   fixed_acidity         1199 non-null   float64 
 1   volatile_acidity      1199 non-null   float64 
 2   citric_acid           1199 non-null   float64 
 3   residual_sugar        1199 non-null   float64 
 4   chlorides             1199 non-null   float64 
 5   free_sulfur_dioxide   1199 non-null   float64 
 6   total_sulfur_dioxide  1199 non-null   float64 
 7   density               1199 non-null   float64 
 8   pH                    1199 non-null   float64 
 9   sulphates             1199 non-null   float64 
 10  alcohol               1199 non-null   float64 
 11  quality_label         1199 non-null   category
dtypes: category(1), float64(11)
memory usage: 113.7 KB
In [162]:
wine_test.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 400 entries, 21 to 1266
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   fixed_acidity         400 non-null    float64 
 1   volatile_acidity      400 non-null    float64 
 2   citric_acid           400 non-null    float64 
 3   residual_sugar        400 non-null    float64 
 4   chlorides             400 non-null    float64 
 5   free_sulfur_dioxide   400 non-null    float64 
 6   total_sulfur_dioxide  400 non-null    float64 
 7   density               400 non-null    float64 
 8   pH                    400 non-null    float64 
 9   sulphates             400 non-null    float64 
 10  alcohol               400 non-null    float64 
 11  quality_label         400 non-null    category
dtypes: category(1), float64(11)
memory usage: 38.0 KB
In [164]:
#Calculate the counts of each quality appear in the training dataset
wine_train['quality_label'].value_counts(normalize = True)
Out[164]:
normal       0.818182
excellent    0.144287
poor         0.037531
Name: quality_label, dtype: float64
In [166]:
# This shows that there is 0 rows that has missing data in the training dataset 
wine_train.isnull().sum()
Out[166]:
fixed_acidity           0
volatile_acidity        0
citric_acid             0
residual_sugar          0
chlorides               0
free_sulfur_dioxide     0
total_sulfur_dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality_label           0
dtype: int64
In [168]:
#This table shows the mean value for each predictor variable at each wine wuality level 
mean_sum_table = pd.DataFrame()
wine_vars = ['fixed_acidity','volatile_acidity','citric_acid',
               'residual_sugar','chlorides','free_sulfur_dioxide',
               'total_sulfur_dioxide','density','pH','sulphates','alcohol']
for var in wine_vars: 
    mean_sum_table['mean_',var] = wine_train.groupby(wine_train['quality_label'])[var].mean()
    
#remove the '(' and ')' in colomn names
new_names = {col: var for col, var in zip(mean_sum_table.columns, wine_vars)}
mean_sum_table = mean_sum_table.rename(columns=new_names)
mean_sum_table
Out[168]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol
quality_label
poor 8.040000 0.688889 0.174000 2.782222 0.091778 11.755556 34.533333 0.996834 3.381556 0.592444 10.226667
normal 8.271254 0.536473 0.261539 2.533384 0.090384 16.633537 49.683996 0.996890 3.311335 0.648787 10.263439
excellent 8.875723 0.408353 0.375029 2.713873 0.076942 14.043353 34.283237 0.996030 3.290231 0.749942 11.533333

Training Data Visualization

In [169]:
# To see how data is distributed for every column: we create distribution plots for each of the predictor varibles 

wine_vars = ['fixed_acidity','volatile_acidity','citric_acid',
               'residual_sugar','chlorides','free_sulfur_dioxide',
               'total_sulfur_dioxide','density','pH','sulphates','alcohol']




var_plots = [] 
for var in wine_vars: 
    var_plot = (
    alt.Chart(wine_train)
    .mark_bar()
    .encode( x= alt.X(var, title = (var.replace('_', ' '), "value")),
            y=alt.Y("count()", title = ("density")),
            opacity=alt.value(0.5),
            color = alt.value('purple')
           )
    
)
    var_plots.append(var_plot) 

for var_plot in var_plots:
    var_plot.display() 
/opt/conda/lib/python3.10/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():

Data Analysis:¶

In [223]:
wine_concav = (
    alt.Chart(wine_quality_data)
    .mark_circle()
    .encode(
        x="volatile_acidity",
        y="citric_acid",
        color= alt.Color("quality_label"))
)
wine_concav
Out[223]:
In [221]:
#Preprocess the data:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
 #Preprocess the data:
wine_preprocessor = make_column_transformer (
    (StandardScaler(), ["volatile_acidity","citric_acid"]),
)

#train the classifier
knn = KNeighborsClassifier(n_neighbors = 2)

X = wine_train.loc[:,["volatile_acidity","citric_acid"]]
y = wine_train["quality_label"]
X_train_sc= X.to_numpy()
y_train_sc= y.to_numpy()

X_test = wine_test.loc[:,["volatile_acidity","citric_acid"]]
y_test = wine_test["quality_label"]
X_test_sc= X.to_numpy()
y_test_sc= y.to_numpy()


knn_fit = make_pipeline(wine_preprocessor,knn).fit(X,y)

wine_test_predictions = wine_test.assign(
    predicted = knn_fit.predict(wine_test.loc[:,["volatile_acidity","citric_acid"]])
)

#wine_test_predictions
wine_test_predictions[['quality_label','predicted']]

correct_preds = wine_test_predictions[
    wine_test_predictions['quality_label'] == wine_test_predictions
]
correct_preds.shape[0] / wine_test_predictions.shape[0]

knn_fit
/tmp/ipykernel_133/2660049447.py:36: FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. Do `left, right = left.align(right, axis=1, copy=False)` before e.g. `left == right`
  wine_test_predictions['quality_label'] == wine_test_predictions
Out[221]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['volatile_acidity',
                                                   'citric_acid'])])),
                ('kneighborsclassifier', KNeighborsClassifier(n_neighbors=2))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['volatile_acidity',
                                                   'citric_acid'])])),
                ('kneighborsclassifier', KNeighborsClassifier(n_neighbors=2))])
ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                 ['volatile_acidity', 'citric_acid'])])
['volatile_acidity', 'citric_acid']
StandardScaler()
KNeighborsClassifier(n_neighbors=2)
In [213]:
wine_acc1 = knn_fit.score(
    wine_test.loc[:,["volatile_acidity","citric_acid"]],
    wine_test['quality_label']
)
wine_acc1
Out[213]:
0.7325

The accuracy with K=2 is 73.25%.

In [227]:
# Parameter value selection
knn = KNeighborsClassifier() 
wine_tune_pipe = make_pipeline(wine_preprocessor,knn)
parameter_grid = {
    "kneighborsclassifier__n_neighbors":range(1,100,2),
}

wine_tune_grid = GridSearchCV(
    estimator = wine_tune_pipe,
    param_grid = parameter_grid,
    cv=5,
    n_jobs=-1
)

gs_results = gs.fit(X_train_sc, y_train_sc)


accuracies_grid = pd.DataFrame(
    wine_tune_grid
    .fit(wine_train.loc[:,["volatile_acidity","citric_acid"]],
        wine_train["quality_label"]
        ).cv_results_)

accuracies_grid = accuracies_grid[["param_kneighborsclassifier__n_neighbors", "mean_test_score", "std_test_score"]
              ].assign(
                  sem_test_score = accuracies_grid["std_test_score"] / 5**(1/2)
              ).rename(
                  columns = {"param_kneighborsclassifier__n_neighbors" : "n_neighbors"}
              ).drop(
                  columns = ["std_test_score"]
              )
accuracies_grid.head()
Out[227]:
n_neighbors mean_test_score sem_test_score
0 1 0.763152 0.007488
1 3 0.779801 0.005455
2 5 0.786475 0.005927
3 7 0.798992 0.005868
4 9 0.800649 0.009397
In [228]:
print('Best Accuracy: ', gs_results.best_score_)
print('Best Parametrs: ', gs_results.best_params_)
Best Accuracy:  0.8223535564853556
Best Parametrs:  {'n_neighbors': 43}

From above, we can say that our best model is with K = 43. To observe its accuracy, we can re-perform the fitting to output the new accuracy.

In [217]:
knn_43 = KNeighborsClassifier(n_neighbors = 43)
knn_fit_43 = make_pipeline(wine_preprocessor,knn_11).fit(X,y)
wine_acc = knn_fit_11.score(
    wine_test.loc[:,["volatile_acidity","citric_acid"]],
    wine_test['quality_label']
)
wine_acc
Out[217]:
0.8475
In [218]:
#Confusion matrix for KNN 
print(pd.DataFrame(y_test)['quality_label'].value_counts())
normal       338
excellent     44
poor          18
Name: quality_label, dtype: int64
In [219]:
pd.crosstab(
    wine_test_predictions['quality_label'],
    wine_test_predictions['predicted']
)
Out[219]:
predicted excellent normal poor
quality_label
poor 0 17 1
normal 75 263 0
excellent 29 15 0
In [ ]:
 

Expected Outcomes and Significance:¶

  • Our expected outcome from this strategy is to have a model that can predict the quality of a Portuguese “Vinho Verde” as accurately similarily to wine experts as possible.

  • Using a data mining approach to classifying wine qualities could have a huge significance to the wine industry. When it is time for new wines to be certified, many countries require by law for the sensory analysis to be done by human testers. However, all testers have their own unique experience and thus their analysis is inherently biased. Our approach to classification remains objective. Some researchers suggest these data-driven approaches could aid in the efficiency of wine evaluation; for example, an expert has to repeat their evaluation only if there is a significant difference between their classification and the model’s (Cortez, P., et al, 2009). Looking to the future, could classification models like ours aid new winemakers in legitimizing their products without the need for expensive evaluations?

Citations:¶

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J. (2009). Modelling wine preference by data mining from physicochemical properties. Decision Support Systems, 47(4), 547-553. https://doi.org/10.1016/j.dss.2009.05.016

Petropoulos, S., Karavas, C. S., Balafoutis, A. T., Paraskevopoulos, I., Kallithraka, S., Kotseridis, Y. (2017). Fuzzy logic tool for wine quality classification. Computers and Electronics in Agriculture, 142(Part B), 552-562. https://doi.org/10.1016/j.compag.2017.11.015

In [ ]:
 
In [ ]: